13 research outputs found
FaceLit: Neural 3D Relightable Faces
We propose a generative framework, FaceLit, capable of generating a 3D face
that can be rendered at various user-defined lighting conditions and views,
learned purely from 2D images in-the-wild without any manual annotation. Unlike
existing works that require careful capture setup or human labor, we rely on
off-the-shelf pose and illumination estimators. With these estimates, we
incorporate the Phong reflectance model in the neural volume rendering
framework. Our model learns to generate shape and material properties of a face
such that, when rendered according to the natural statistics of pose and
illumination, produces photorealistic face images with multiview 3D and
illumination consistency. Our method enables photorealistic generation of faces
with explicit illumination and view controls on multiple datasets - FFHQ,
MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware
GANs on FFHQ dataset achieving an FID score of 3.5.Comment: CVPR 202
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Adapting generic speech recognition models to specific individuals is a
challenging problem due to the scarcity of personalized data. Recent works have
proposed boosting the amount of training data using personalized text-to-speech
synthesis. Here, we ask two fundamental questions about this strategy: when is
synthetic data effective for personalization, and why is it effective in those
cases? To address the first question, we adapt a state-of-the-art automatic
speech recognition (ASR) model to target speakers from four benchmark datasets
representative of different speaker types. We show that ASR personalization
with synthetic data is effective in all cases, but particularly when (i) the
target speaker is underrepresented in the global data, and (ii) the capacity of
the global model is limited. To address the second question of why personalized
synthetic data is effective, we use controllable speech synthesis to generate
speech with varied styles and content. Surprisingly, we find that the text
content of the synthetic data, rather than style, is important for speaker
adaptation. These results lead us to propose a data selection strategy for ASR
personalization based on speech content.Comment: ICASSP 202
Novel-View Acoustic Synthesis from 3D Reconstructed Rooms
We investigate the benefit of combining blind audio recordings with 3D scene
information for novel-view acoustic synthesis. Given audio recordings from 2-4
microphones and the 3D geometry and material of a scene containing multiple
unknown sound sources, we estimate the sound anywhere in the scene. We identify
the main challenges of novel-view acoustic synthesis as sound source
localization, separation, and dereverberation. While naively training an
end-to-end network fails to produce high-quality results, we show that
incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms
enables the same network to jointly tackle these tasks. Our method outperforms
existing methods designed for the individual tasks, demonstrating its
effectiveness at utilizing 3D visual information. In a simulated study on the
Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source
localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation
and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on
novel-view acoustic synthesis. Code, pretrained model, and video results are
available on the project webpage (https://github.com/apple/ml-nvas3d)
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
While Automatic Speech Recognition (ASR) systems are widely used in many
real-world applications, they often do not generalize well to new domains and
need to be finetuned on data from these domains. However, target-domain data
usually are not readily available in many scenarios. In this paper, we propose
a new strategy for adapting ASR models to new target domains without any text
or speech from those domains. To accomplish this, we propose a novel data
synthesis pipeline that uses a Large Language Model (LLM) to generate a target
domain text corpus, and a state-of-the-art controllable speech synthesis model
to generate the corresponding speech. We propose a simple yet effective
in-context instruction finetuning strategy to increase the effectiveness of LLM
in generating text corpora for new domains. Experiments on the SLURP dataset
show that the proposed method achieves an average relative word error rate
improvement of on unseen target domains without any performance drop in
source domains
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Controllable generative sequence models with the capability to extract and
replicate the style of specific examples enable many applications, including
narrating audiobooks in different voices, auto-completing and auto-correcting
written handwriting, and generating missing training samples for downstream
recognition tasks. However, under an unsupervised-style setting, typical
training algorithms for controllable sequence generative models suffer from the
training-inference mismatch, where the same sample is used as content and style
input during training but unpaired samples are given during inference. In this
paper, we tackle the training-inference mismatch encountered during
unsupervised learning of controllable generative sequence models. The proposed
method is simple yet effective, where we use a style transformation module to
transfer target style information into an unrelated style input. This method
enables training using unpaired content and style samples and thereby mitigate
the training-inference mismatch. We apply style equalization to text-to-speech
and text-to-handwriting synthesis on three datasets. We conduct thorough
evaluation, including both quantitative and qualitative user studies. Our
results show that by mitigating the training-inference mismatch with the
proposed style equalization, we achieve style replication scores comparable to
real data in our user studies.Comment: ICML 202
Meta-analysis identifies common variants associated with body mass index in east Asians
10.1038/ng.1087Nature Genetics443307-311NGEN